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NON-RANDOM  MISSING  DATA 


by  Jerry  A.  Hausman  and  A.  Michael  Spence 


Missing  data  is  an  important  problem  in  applied  statistics  and  econo- 
metrics.  It  is  well  known  that  in  a  sample  from  a  multivariate  distribution 
if  each  observation  is  discarded  for  which  one  or  more  entries  is  missing, 
statistics  computed  from  the  reduced  sample  may  not  be  appropriate  estimates 
of  their  population  counterparts.   Likewise,  in  econometrics,  estimation  of 
structural  models  using  only  the  "complete"  data  will  lead  to  biased  and  in- 
consistent estimates  of  the  true  parameters  if  the  realizations  of  the  missing 
observations  are  correlated  with  the  residuals  in  the  equation  being  estimated. 
To  overcome  this  problem,  much  work  has  been  done  to  replace  the  missing 
entries  with  estimates  derived  from  the  available  data.   The  simplest  example 
of  such  procedures  is  to  let  the  underlying  distribution  of  a  vector  of  random 
variables  be  multivariate  normal  and  to  assume  that  only  one  of  the  variables 
has  missing  observations.   Unbiased  estimates  of  the  missing  entries  may  be 
obtained  by  regressing  the  observed  entries  on  the  remaining  data  in  each 
observation.   The  estimated  regression  parameters  can  then  be  used  to  form 
an  unbiased  prediction  of  the  missing  entries  so  long  as  the  conditional 
expectation  of  the  missing  entries  is  the  same  as  for  the  observed  entries; 
that  is,  the  probability  of  an  observation  being  missing  cannot  depend  on  the 
value  that  would  have  been  observed.   Much  more  complicated  situations  where 
more  than  one  variable  may  be  missing  in  different  patterns  among  the  ob- 
servations may  occur,  and  these  situations  have  been  discussed  by  Afifi  and 
Elashoff  [1],  Hartley  and  Hocking  [7],  Orchard  and  Woodbury  [14], 
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and  recently  by  Dempster,  Laird,  and  Rubin  [  4  ].   The  techniques  used  are 
generalizations  of  the  simple  example  in  which   the  missing  entries  are 
conditioned  on  the  available  information  and  predictions  are  formed  from 
the  mean  and  variance  of  the  multivariate  normal  distribution. 

These  procedures  all  depend  on  the  conditional  expectation  for  a 
variable  remaining  the  same  whether  or  not  it  is  observed.  Yet  situations 
may  occur  where  this  assumption  is  invalid.   Returning  to  the  simple  example, 
consider  the  case  of  a  censored  variable  which  is  not  recorded  if  its  value 
is  less  than  zero.   The  conditional  expectation  of  the  observed  entries, 
conditional  on  both  the  remaining  data  in  each  observation  and  the  fact  that 
they  are  observed,  exceeds  the  conditional  expectation  of  the  variable  when 
the  fact  of  censoring  is  not  taken  into  account.   Then 
predictions  of  the  missing  entries  using  estimated  regression  parameters 
from  the  observed  entries  will  lead  to  upward  biased  predictions.   Of  course, 
in  this  simple  example  the  problem  would  be  readily  apparent  so  that  ap- 
propriate procedures  could  be  used  to  form  consistent  predictions.   However, 
a  more  subtle  and  less  readily  apparent  problem  may  occur  which  can  lead  to 
similar  consequences.   Suppose  the  probability  of  observing  the  entry 
depends  on  its  realization.   That  is,  negative  values  of  the  variable  have 
a  higher  probability  of  being  missing  but  some  negative  values  will  still 
be  observed.   If  simple  regression  techniques  are  used  to  form  the  predictions, 
biased  and  inconsistent  predictions  will  again  result.   Thus  a  statistical 
procedure  is  required  which  will  correct  this  problem  if  it  is  present  and 
yield  predictions  with  desirable  properties. 

In  this  paper  we  apply  maximum  likelihood  procedures  which  can  be  used 
to  test  whether  the  problem  exists  and  also  to  provide  predictions  of  the 
missing  entries.   These  procedures  are  closely  related  to  work  in  the 


econometric  literature  by  Quandt  [  15  ] ,  Goldfeld  and  Quandt  [5  ] ,  Hanoch 
[6  ] ,  Hausman  and  Wise  [  8  ] ,  Heckman  [9  ] ,  Lee  [  11  ] ,  Maddala  and 
Nelson  [  12  ],  and  Nelson  [  13  ] .   Nelson  has  called  this  general  problem 
the  "stochastic  censoring"  problem.   While  our  model  differs  from  his 
specification,  the  problem  being  considered  is  similar.  Along  with  the 
statistical  specification  of  the  problem  we  will  discuss  a  method  for  finding 
the  maximum  likelihood  estimates  of  the  unknown  parameters.   Previous  authors 
have  reported  difficulties  in  maximizing  similar  likelihood  functions,  but 
using  a  modified  scoring  method  first  employed  by  Berndt,  Hall,  Hall,  and 
Hausman  [  8  ],  excellent  results  are  obtained.  Alternative  consistent  but 
asymptotically  inefficient  estimators  will  also  be  considered  in  the  appendix. 
Such  estimators  for  related  problems  have  been  proposed  by  Amemiya  [  2  ] , 
Hanoch  [  6  ],  Heckman  [  9  ],  and  Lee  [  11  ] .   Lastly,  an  example  from  a 

survey  of  84  Canadian  industries  is  used  to  demonstrate  the  usefulness  of 

1 
the  procedure  and  the  existence  of  the  problem  in  actual  data. 


The  problem  came  to  the  attention  of  the  authors  in  the  context  of  estimating 
a  simultaneous  equations  model  of  the  determinants  of  structure  and  perfor- 
mance in  Canadian  industry. 


1.   Statistical  Specification 

The  most  common  specification  for  the  missing  data  problem  is  to  assume 
the  observations  z  ,  i  =  l,...,T,  are  drawn  from  a  multivariate  normal  distri- 
bution N(u,I)  where  z  and  y  are  assumed  to  be  vectors  of  length  k  and  the 
covariance  matrix  I  is  an  unknown  k  x k  positive  definite  matrix.   For  the 
time  being,  we  consider  the         case  where  some  of  the  first  entries 
are  not  observed;  that  is  z1  .  is  missing  for  the  incomplete  observations. 
Reordering  the  observations  so  that  the  first   s   observations  correspond  to 
the  complete  observations,  the  sample  is  divided  into  two  parts:   i  =  l,...,s 
corresponds  to  all  entries  observed  while  i  =  s+l,...,T  corresponds  to  z 


li 


missing.   Then  using  more  traditional  regression  notation  set  z..  .  equal 

to  y.  and  the  k-1  subvector  (z„.  ,  ...,z.  .)  equal  to  X.  so  that 
i  2i      ki  i 

(1.1)       E(yi|Xi)  =  X.3  -  ]lx   +  ^12Z22(Xi"yx) 
V(y.lx.)  =  a2  =  Eu  -  E12E^21 


where  u,  and  E,.  correspond  to  the  mean  and  variance  of  zn  while  £..  corres- 
1      11       r  1        ij 

ponds  to  the  partitioning  of  z  into  y  and  X.   To  predict  the  missing  T-s 

values  of  y.  the  following  procedure  is  often  used.   For  the  sample  i  =  l,...,s 

run  least  squares  of  y.  on  X,  to  estimate  6,  say  3nTQ.   Then  form  the 

conditional  predictions  y.  =  X.S„T„  for  i  =  s  +  l,...,T.   Under  the  assumption 

11  OLS 

that  E(y.[x.,y.  not  observed)  =  E(y  |x.,y .  observed)  then  the  predictions 
are  known  to  be  the  best,  linear  unbiased  predictions. 

Problems  arise,  however,  when  the  equality  of  the  conditional  expec- 
tations fails  to  hold.   The  simple  censoring  case  corresponds  to  the  Tobit 

In  that  case,^ 
model  in  econometrics  (Tobin  [16]).  /y~!  is  not  observed  if  its  realization 

is  less  than  a  constant,  say  zero.   Then  the  conditional  expectation  of  y. 


given  X.  and  the  fact  that  it  is  observed  is 

°  1 

<KX B/a) 
(1.2)      ECyJx^y^O)  -  X.B  +  a  H^/o) 


where  (J)  and  $  correspond  to  the  unit  normal  density  and  distribution  function, 
respectively.   This  result  follows  from  writing  y.  =  X.B  +  £.  where 
£.  ^N(0,a2)  and  noting  that  y.  is  observed  if  £.  -  -X.B.   The  expected  value 
of  £.  then  follows  from  -  2/0_2 


1 


E(£i|xi,£i>-X.B)  -  i_$(-XiB/a) 


f°°       £.        e"ei  /2a 


i  d£i 


-x.B^tto2 


(1.3) 

x  r°°  4>(x.B/a) 

=  T^x^>  J_x.B/or*<r>dr  =  °Ww 

1 

On  the  other  hand,  for  those  y.'s  which  are  not  observed 

l 

<KX.B/c) 

(1.4)  E(yi|xi,yi<0)  -  X.B  -  o  jz^j^ 

If  regression  coefficients  formed  from  the  good  data  corresponding  to  equation 

(1.2)  are  used  to  predict  the  missing  data  corresponding  to  equation  (1.4), 

an  upward  bias  results.   In  this  situation  estimates  of  B  and  a   may  be 

estimated  from  the  log  likelihood  function  (Tobin  [17]  and  Amemiya  [2]). 

,    s  T 

(1.5)  I   =  k  -  f  log  a2  -  -±-    I     (y.-X.B)2+   £  log  [l-*(X.B/a)] 

*  1-2   .-,11  ,  -,  1 

2a  i=l  i=s+l 

Then  consistent  conditional  expectations  would  be  formed  using  equation  (1.4). 
Alternative  estimators  [which  require  less  computation]  than  maximum  likeli- 
hood are  available.   They  will  be  discussed  in  the  appendix.   However, 
numerous  algorithms  and  computer  programs  exist  which  yield  maximum  likelihood 
estimates  at  relatively  low  cost  even  for  very  large  samples. 


The  term  c))(X . B/a)/$(X. B/a)  corresponds  to  the  inverse  Mills'  ratio.   The 
denominator  gives  the  probability  that  y.  is  observed  conditional  on  X.. 

2  s 

The  constant  term  k  in  the  log  likelihood  function  equals  -  y  log  (2tt)  . 


The  particular  case  of  censoring  is  apt  to  be  evident  in  a  sampling 
situation.   A  more  common  but  less  evident  sampling  situation  arises  when 
the  probability  of  observing  y  varies  with  its  realization  and  perhaps  the 
realizations  of  other  random  variables  W..   Then  defining  the  indicator  d.  =1 
for  y.  being  observed  and  d.  =0  if  y.  is  not  observed  we  might  specify  a 
model  where  the  probability  of  observing  y.  depends  on  its  realization  and 


other  variables.   First,  write  y.  in  a  regression  equation  as  y.  =  X. $  +  £... 

1        °  11    lx 

A  general  linear  specification  could  make  the  probability  of  observing 

y.  take  the  following  form:   v.  is  observed  if  ay.  +  X.9  +  W.y  +  n.  -  0 
l  J  x  1111 

where  W.  are  variables  which  do  not  enter  the  conditional  expectation  but 
affect  the  probability  of  y  being  observed  and  the  n.  are  iid  random  variables. 
Substituting  for  y   leads  to  X.(a3  +  6)  +  W.y  +  ae  .  +  n .  .   But  since  a  and  6 
enter  the  specification  in  an  equivalent  way  it  is  convenient  to  combine  them 
into  a  vector  and  write  the  specification  as  Xi  +  Wf  +  e„  which  corres- 
ponds to  the  "reduced  form"  of  the  model  where  e„.  =  ae1.+n.-   Defining  the  vectors 

5 


R.  =  [X.  W.]  and  6  = 
l     li 


then  gives  a  probability  of  y.  being  observed 


y 


(1.6)       pr(di  =  l)  =  pr(R.6  +  e2i>0) 


Then  assuming  that  n.  is  normally  distributed  and  taking  the  variance  of 
e     equal  to  1,  a=l,  as  a  normalization,  we  have  the  probit  model 


(1.7)       pr(d.  =1)  =  $(R,5)   and  pr(d.  =  0)  -  1  -  $(R.6) 


However,  besides  knowing  whether  d.  =0  or  d.  =1,  we  also  know  the  value  of 
y.        if         it  is  observed,  dj_  =  l.   This  observation  yields  infor- 
mation about  £..   which  in  turn  gives  information  about  e_ .   if  they  are 
correlated.   Thus  in  specifying  the  likelihood  for  those  observations  with 
d  =1,  equation  (1.7)  will  be  altered  to  use  this  additional  knowledge. 


One  of  two  sample  outcomes  occur  for  each  observation.   The  first 

outcome  is  d .  = 1  so  that  y.  is  observed  along  with  X.  and  R. .   Writing  the 
1  i    _,  11         ° 

then, 

joint    (generalized)   density  and /fonditioning  on   the  observed  value  of   y. 
gives   the   expression 


(1.8) 


f(y.,d.=l|Xi,Ri)    =  g(di  =  l|y.,Xi,Ri)h(y.|Xi) 


=  pr(R.S  +  e2i>0|yi,Xi,Ri)- 


y.-x.B 


with  correlation  p 

Then  since  e. ,  and  £„,  are  bivariate  normal. the  regression  equation  for  £„. 
li      21  A  2i 

P  2 

given  e.  .    is  E„,  =  —  en  .  +  V.  where  V.  is  N(0,1  -  p  )  and  is  uncorrelated  with 
li     2i   a,   li   i        i 

£    using  a „  =  1  from  the  earlier  normalization.   Using  this  relationship 
equation  (1.8)  may  be  rewritten  as 


(1.9) 


f(y.,d.  =l|x.,R.)  =  pr(R,6+  -2-  (y.  -X.B)  +V.  ^  0)-f-<J> 
i  i   '  i  i    r      i    an   i   i    i    a 


y.-Xj 
i  i 


=  $ 


fR.6+  -2-y.  --fi-X.B 
1  X   °1 
5 


(i-pO 


2^^ 


O, 


y.-X.B 


Having  derived  the  likelihood  when  y  is  observed,  the  case  where  y.  is  not 
observed  so  that  d  =0  follows  easily.  Basically  y.  is  "integrated  out"  of 
equation  (1.8)  to  find  the  likelihood  as  expected  from  equation  (1.6) 

(1.10)      f(di  =  0|xi,Ri)  =  pr(Ri6  +  e21<0|x.,Ri) 


=  1-$(R  6) 


Using  the  ordering  of  the  data  where  the  first  s  observations  corres- 


pond to  y.  being  observed  and  the  last  T-s  values  of  y.  being  missing,  the 


log  likelihood  function  follows  from  equations  (1.9)  and  (1.10).   Setting 


c,  a  constant,  equal  to  -  —   log  2tt  yields  the  log  likelihood  function 


(1.11)  I   =   c   -  |  log  aj  +     Z 


log   $ 


V  +  ^(  w> 


(1-P2)^ 


— 5-  (y.-X.B): 


+       E       log   (1-$(R.<5)) 
i=s+l  L 


Maximization  of  the  log  likelihood  function  in  equation  (1.11)  leads  to  esti- 
mates of  9  =  ($,6,p,a, )  which  have  the  usual  desirable  properties  of  maximum 
likelihood  estimates.  Although  this  specification  does  not  satisfy  the 
classical  regularity  conditions  since  the  likelihood  is  a  mixture  of  densities 
and  probabilities,  Amemiya's   [1]   proofs  for  the  Tobit  case  carry  over  in 
a  straightforward  manner  for  this  problem. 

The  crucial  role  of  the  covariance  between  £n  .  and  £„.  is  evident  in 

li      2i 

the  correlation  coefficient  p  in  equation  (1.9).   If  p  =  0  so  that  no  corre- 
lation exists,  then  the  probability  of  observing  y.  is  independent  of  £.. .  and 
the  least  squares   procedure  of  predicting  the  missing  y  's  from  the  regression 
of  the  observed  y.'s  is  a  correct  procedure.   On  the  other  hand,  if  p ^ 0  then 
the  probability  of  observing  y.  depends  on  £..  .  so  that  the  regression  pro- 
cedure will  lead  to  the  incorrect  conditional  expectation.   The  conditional 


expectation  is  calculated  from  again  writing  y.  =  X.3  +  £-i  .  and  noting  that 
the  regression  equation  for  £  .  given  £„.  may  be  written  as  £..  ,  =  pa..£„.  +cu. 
where  to.  ^  N(0,a^ (1-p2))  and  is  independent  of  £0 .  (the  normalization 


*2i 


a'z  =  1  has  again  been  used).   Then  the  conditional  expectation  of  £^  .  given 


that  d .  =  1  or  equivalently  that  £„.  ^-R.  6  is 


Since  e„ .  =  a£,  .  +  n .  ,  p  5*  0  when  a  $   0  which  is  equivalent  to  having  the 
2i     li   1 

probability  of  observing  y.  depend  on  its  realization. 


(1.12)  E(eli|e2i>-Ri6)   =  PO±E(e2±  |  z1±  >  -R±6)  +  E(a>±  |e21  >  -R±6) 

(J)(R.6) 


=  pa 


1    $(R.6) 


1 


using  equation  (1.3)  and  the  normalization  of  a  =1.   Then  the  conditional 
expectation  of  y.  follows 

<KR.<5) 
(1.13)      E(y.|Xi,d1  =  l)  =X.3  +  pai  ^-^y  . 


Thus,  as  in  the  earlier  case,   least  squares  on  the  observed  data  leads 

to  biased  and  inconsistent  estimates  of  $.   Furthermore,  predictions  for 

the  missing  data  will  be  inconsistent  since 

<j>(R.6) 
(1.14)      E(y.  |x.,d.  =0)  =  X.6  -  pa 


i1  ±'  i  '         ir       w"\   l-O(R.S) 

i 


To  form  consistent  predictions,  consistent  estimates  of  the  parameters  B,<5,p, 
and  a   are  required,  and  equation  (1.14)  is  used  to  form  the  expectations. 

Unfortunately,  the  log  likelihood  function  does  not  appear  to  be 
globally  concave  so  that  some  care  must  be  taken  in  choosing  an  algorithm 
to  find  the  maximum.   Furthermore,  some  authors  have  reported  difficulty  in 
broadly  similar  situations.   Therefore,  the  next  section  discusses  an 
algorithm  which  has  performed  quite  well  in  practice. 


10 


2.   Estimation 

To  maximize  the  likelihood  function  from  equation  (1.11),  an  algorithm 

proposed  by  Berndt,  Hall,  Hall,  and  Hausman   [  3  ]   is  used.   This  method  is 

similar  to  the  method  of  scoring,  e.g.  Rao   [16  ,  pp.  366ff],  but  instead 

of  requiring  large  sample  expectations  to  be  taken  to  evaluate  Fisher's 

information  matrix,  it  relies  on  the  result  that  in  large  samples  the 

covariance  of  the  gradient  is  a  consistent  estimate  of  the  information  matrix 

in  the  neighborhood  of  the  optimum.   Letting  9  be  the  parameter  vector  and  the 

gradient  g(6)  =  3S"|g,  its  asymptotic  covariance  in  the  neighborhood  of  the 

dH  9  T   of.    9fi  '  log 

optimum  is  approximated  by  Q(8)  =  £  -tt~L  -^~U  where  f.  is  the^  likelihood  of 

i=l  oo  '0  d6  '0         l       a- 

the  ith  observation.   Thus,  the  method  requires  only  computation  of  first 

~i+l 
derivatives.   The  updating  formula  at  the  jth  iteration  for  6  '  is  then 

(2.1)       6j+1  =  P   +   XjQ(§j)~  g(§j) 

where  X  >0  is  the  stepsize  chosen  in  the  direction  Q(8  )   g(§  ) .   The  step- 
size  X  is  chosen  according  to  the  criterion  in  Berndt,  et  al.,  p.  656,  and 
convergence  to  a  local  maximum  is  assured.   The  algorithm  has  the  desirable 
"uphill"  property:   an  improvement  in  the  likelihood  function  occurs  at 
each  step.   Experience  with  this  algorithm  has  been  very  satisfactory  so  long 
as  the  derivatives  are  calculated  correctly.   The  algorithm  can  be  considered 
a  generalization  of  the  Gauss-Newton  algorithm.   Estimates  of  the  asymptotic 

covariance  matrix  of  the  estimates  follow  from  Q(8_„  )   ,  where  8.„  is  the 

ML  ML 

value  of  8  which  maximizes  the  likelihood  function.   Care  must  be  taken  to 
insure  that  the  global  maximum  has  been  found  although  in  practice  no 
difficulties  arose. 


This  statement  is  not  meant  to  imply  that  numerical  derivatives  are  inferior 
to  analytical  calculation  of  derivatives  by  the  computer.   Rather,  all  con- 
vergence problems  to  date  have  been  caused  by  incorrect  algebraic  computations 
of  the  derivatives. 


11 


Although  initial  consistent  estimates  may  be  derived  from  methods 
discussed  in  the  appendix  and  used  as  starting  values  for  the  maximum  likeli- 
hood algorithm,  an  easier  procedure  performed  well  in  practice.   Least  squares 

~1 

estimates  on  the  observed  data  are  used  to  set  initial  values  for  3  and 

~1       ~1        ~1 

a,  while  6=0  and  p  =0.   The  maximum  likelihood  algorithm  always  converged 

to  the  global  optimum  beginning  from  these  estimates.   This  maximum  likeli- 
hood seems  to  provide  a  relatively  inexpensive  estimator  with  little  trouble 
encountered  in  computing  the  estimates. 


On  the  sample  of  84  observations  discussed  in  the  next  section,  computer 
costs  on  the  370-168  at  MIT  averaged  about  $3.  For  a  larger  problem  with 
nearly  2000  observations,  the  cost  increased  to  about  $8.50. 
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3.   Empirical  Example 

We  now  apply  the  statistical  model  specified  in  Section  1  to  a  body  of 

data  collected  on  the  economic  characteristics  of  Canadian  and  US  industries. 

The  sample  consists  of  data  on  84  industries  with  the  number  of  missing 

observations  varying  from  zero  to  35.  While  many  of  the  characteristics 

are  continuous  and  appear  approximately  normal  after  transformation,  other 

represented^ 
characteristics  are   /by  dummy  variables.   These  dummy  variables  might 

well  be  expected  to  enter  the  vector  R  in  equation  (1.6)  even  if  they  did 

not  appear  in  the  conditional  mean.   Two  variables  with  missing  observations 

were  chosen  to  be  studied.   The  first  variable,  FSE,  represents  the  amount  of 

Canadian/ 
foreign  control  in  the  /industry.   It  is  constructed  by  dividing  the  value 

of  shipments  by  establishments  classified  as  50%  or  more  foreign-controlled 

by  the  value  of  shipments  by  all  establishments  in  the  industry.   It  was 

felt  that  the  missing  observations  here  might  well  correspond  to  low  values 

of  FSE  since  data  are  not  collected  on  foreign  ownership  unless  it  is 

"significant."  Twenty-four  out  of  84  industries  are  missing  observations 

on  FSE.   The  second  variable  studied,  CDRC,  represents  the  cost  disadvantage 

of  small  firms.   It  is  constructed  by  dividing  the  value  added  per  worker 

in  the  smallest  one-half  firms  in  an  industry  by  the  value  added  per  worker 

in  the  other  one-half  of  the  industry.   For  CDRC,  13  out  of  84  industries 

have  missing  observations.   However,  no  definite  relation  between  a  missing 

observations  and  the  value  of  the  variable  was  thought  to  exist. 

To  test  for  "biased"  missing  data,  a  test  of  the  hypothesis  p = 0  in 

equation  (1.9)  is  undertaken.   Inspection  of  that  equation  or  the  likelihood 

function  in  equation  (1.11)  demonstrates  that  if  p  =  0,  the  likelihood 

function  factors  into  two  parts:   the  probit  part  representing  the  probability 
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of  observing  the  variable  and  a  least  squares  part  representing  the  density 
of  the  variable.   Thus,  the  hypothesis  can  be  tested  by  doing  a  likelihood 
ratio  test  comparing  the  maximum  of  equation  (1.11)  with  the  sum  of  the 
probit  and  least  square  likelihoods  when  p  =  0.   An  asymptotically  equiva- 
lent test  used  here  which  is  computationally  easier  is  to  do  a  x2  test 
using  the  computed  asymptotic  variance  of  p.   The  test  statistic  is  computed  by 
dividing  p2  by  the  estimate  of  its  asymptotic  variance.   Besides  testing  this  hy- 
pothesis,  we  also  compare  the  estimated  $   to  the  least  squares  Bm  <,  •   Lastly,  we 
form  the  conditional  predictions  for  the  missing  data  using  equation  (2.3)  and  compare 
them  to  the  predictions  using  the  least  squares  estimates,  y.  =X.3    .   This 
last  comparison  captures  the  most  important  aspect  of  the  problem,  whether 
multivariate  normal  methods  (least  squares)  provide  consistent  predictions 
of  the  missing  parameters. 

In  specifying  the  model  for  the  extent  of  foreign  ownership,  FSE, 
the  following  variables  are  used  to  form  the  conditional  mean:   a  constant, 
value  added  per  establishment  in  the  industry  (VPE) ,  the  standard  deviation 
of  the  value  of  shipments  of  firms  in  the  industry  (SSI),  proportion  of  ship- 
ments in  the  US  counterpart  industry  by  the  largest  four  enterprises  in  that 
industry  (US467),  the  proportion  on  nonproduction  workers  in  the  industry  (NPW) , 
and   average  overhead  labor  costs  in  the  industry  (WNP) .  All  these 
variables  were  included  in  the  probit  function  of  equation  (1.6)  along  with  estimates 
of  a  dummy  variable  whose  value  equals  one  for  consumer  goods  industries  if  they 
sell  a  convenience  good  and  is  zero  otherwise  (CON).   In  Table  1  the 
maximum  likelihood  estimates  of  the  parameters  are  presented  along  with 
their  asymptotic  standard  errors.  We  also  present  the  least  squares 
estimates  of  the  same  model.   Note  that  while  the  maximum  likelihood 
estimates  (MLE)  have  the  same  sign  and  order  of  magnitude  as  the  least 
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TABLE   I 
Estimates  for  Extent  of  Foreign  Ownership  (FSE) 


Variable 

Coefficient 

MLE 

OLS 

MLE/OLS 

Constant 

So 

-.972 
(.308) 

-.696 
(.242) 

1.40 

VPE 

h 

-.078 
(.038) 

-.092 
(.044) 

.85 

SSI 

h 

-.970 
(.943) 

-1.13 
(.576) 

.85 

US467 

h 

.077 
(.022) 

.097 
(.016) 

.79 

NPW 

h      . 

.704 
(.193) 

.499 
(.149) 

1.41 

WNP 

*5 

.144 
(.046) 

.118 
(.036) 

1.22 

Constant 

60 

1.11 

(1.57) 

VPE 

61 

.099 
(.051) 

SSI 

62 

-.257 
(.367) 

US467 

63 

-.027 
(.010) 

NPW 

64 

.153 
(.181) 

WNP 

65 

.033 
(.245)   . 

CON 

66 

-.633 
(.325) 

P 

.937 
(.094) 

**  (log 

likelihood  value) 

-160.9 
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squares  estimates  (OLS),   notable    differences  in  the  coefficient  estimates 
are  present  with  the  largest  difference  being  about  40%.   Examination  of 
the  estimate  of  p  shows  that  the  value  of  FSE  significantly  affects  the  probability 
of  it  being  observed.  The  estimate  of  .937  exceeds  its  estimated  asymptotic 
standard  error  by  a  factor  of  almost  10  and  the  asymptotic  x  t  test  that 
p  =  0  has  the  value  of  99.4  which  leads  to  a  decisive  rejection  of  the  null 
hypothesis.   For  comparison  purposes  a  likelihood  ratio  test  was  done  on  the 
ML  value  versus  the  sum  of  the  likelihoods  of  a  probit  estimation  and  the  least 
squares  regression.   A  test  of  p = 0  follows  from  minus  twice  the  difference  of 
the  log  likelihood  which  equals  7.90  and  is  distributed  as  x? •   Again  we  reject 
the  null  hypothesis  although  not  as  decisively  as  before.   Also  p  has  the  ex- 
pected sign  since  examination  of  equation  (1.9)  shows  that  large  values  of  the 
dependent  variable,  FSE,  increase  the  probability  of  observing  it,  d.  =1,  for  p 
taking  positive  values.   In  fact,  this  effect  is  so  strong  that  the  situation  is 
almost  the  pure  censoring  case  of  equation  (1.3)  which  has  p = 1  when  the 
censoring  point  is  known.   This  result  is  in  accord  with  the  prior  judgment 
of  the  economists  who  gathered  the  data  that  low  values  of  FSE  are  less  likely 
to  be  recorded  in  the  data.   Lastly,  the  mean  of  the  24  missing  observations 
is  estimated  using  equation  (1.14)  and  the  maximum  likelihood  estimates. 
Its  value  is  .2195  which  is  considerably  lower  than  the  least  squares  prediction 
of  .5724.   Thus,  the  maximum  likelihood  estimates  seem  valuable  in  finding 
significant  non-randomness  in  the  missing  data. 

For  comparison  purposes,  the  technique  is  applied  to  the  cost  disadvan- 
tage ratio  (CDRC)  which  has  13  out  of  84  industries  missing  observations. 
Here,  it  was  felt  that  the  value  of  the  variable  itself  might  not  significantly 
affect  the  probability  of  it  being  observed.   The  following  variables  are  used 
to  form  the  conditional  mean:   a  constant,  the  standard  deviation  of  the 
total  value  of  shipments  (SSI)  ,  average  assets  per — ■ ■ 
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employee  (LAB2C) ,  and  the  proportion  of  non-production  workers  in  the  in- 
dustry (NPW) .   An  additional  variable, the  proportion  of  shipments  by  the 
US  counterpart  industry  by  the  largest  form  enterprises  (US467),  is  used  in 
the  probit  function.   The  results  presented  in  Table  II  show  that  the  ML 
estimates  of  the  parameters  of  the  conditional  mean  are  extremely  close 
to  the  least  squares  estimates.   While  p  indicates  some  probability  of 
observations  of  CDRC  being  missing  for  high  values  of  CDRC,  it  is  estimated 
quite  imprecisely.   It  is  less  in  absolute  value  than  its  asymptotic  standard 
error  and  the  X-.  test  for  p  =  0  has  a  value  of  only  .5.   A  fact  which  is 
perhaps  more  important  is  that  the  estimated  mean  for  the  13  missing  ob- 
servations using  the  maximum  likelihood  estimates  is  1.10  compared  to  the  mean 
of  the  least  squares  estimates  of  .941.   These  predictions  are  quite  close 
for  a  variable  which  has  a  standard  deviation  of  1.71  in  the  observed  data. 
Thus,  little  evidence  has  been  discovered  that  the  missing  observations  do 
not  form  a  random  sample  conditional  on  the  X  *s.   Furthermore,  the  predicted 
values  of  the  missing  observations  confirm  the  prior  judgment  of  the 
economists  who  constructed  the  data. 

In  this  section  we  have  applied  the  missing  data  specification  to 
two  different  variables.   The  results  are  in  accord  with  prior  judgment. 
Also,  the  maximum  likelihood  technique  performed  well  in  finding  a  potential 
bias  in  the  missing  observations  and  had  a  large  effect  in  predicting  the 
values  of  the  missing  observations. 
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TABLE  II 
Estimates  of  Cost  Disadvantage  Ratio  (CDRC) 


Variable 

Coefficient 

MLE 

OLS 

MLE/OLS 

Constant 

*0 

1.25 
(.087) 

1.24 
(.087) 

1.01 

SSI 

h 

-1.24 
(.629) 

-1.28 
(.552) 

.97 

LAB2C 

h 

-.103 
(.066) 

-.113 
(.063) 

.91 

NPW 

63 

-.408 
(.143) 

-.407 
(.150) 

1.00 

Constant 

6o 

2.89 
(1.68) 

SSI 

61 

-2.10 
(4.47) 

LAB2C 

62 

-.708 
(.703) 

NPW 

63 

-.485 
(1.89) 

U5467 

64 

-.262 
(.143) 

P 

-.492 
(.668) 

I*    (log 

likelihood  value) 

-14.1 
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4.   Conclusion 

In  this  paper  a  method  has  been  proposed  for  predicting  the  values  of 
missing  data  when  the  probability  of  observing  the  variable  depends  on  its 
value.   Basically,  a  univariate  approach  has  been  taken  by  concentrating  on 
one  variable  at  a  time.   Extension  to  the  multivariate  case  where  more 
than  one  variable  may  be  missing  seems  straightforward,  but  may  be  compli- 
cated to  do  computationally.   An  equation  of  the  form  of  equation  (1.6)  could 
be  specified  for  each  variable  and  then  missing  values  of  R.  replaced  by  their 
expected  values.   But  since  the  variance  of  e„.  then  depends  on  the  prediction 
variance  of  the  missing  values  of  R.  which  would  differ  across  observations, 
a  fairly  complex  likelihood  function  would  result  with  the  probit  function 
varying  with  each  pattern  of  missing  observations.   However,  the  increased 
complexity  of  this  situation  might  be  worthwhile  given  the  increased  ef- 
ficiency of  using  multivariate  procedures. 
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Appendix 

Alternative  consistent,  but  inefficient,  methods  of  estimation  which 
may  require  less  computation  are  also  available.   Hanoch   [  6  ]    Heckman  [  9  ], 
and  Lee   [  11  ]  have  used  the  conditional  expectation  of  the  observed  data  in 
equation  (1.13)  to  derive  consistent  estimates  of  the  structural  parameters. 
An  estimate  of  the  inverse  Mills  ratio,  M.  =  (j)(R.6)/<J>(R^6)  is  made  by  using  a 
maximum  likelihood  probit  program  to  estimate  6.   Then  a  regression  can  be 
run  on  the  observed  data  to  estimate  3  and  a   ~ 

(2.2)       y.  =  X.3  +  0loM.  +  u.      i  =  l,...,s 
11     12  l     l 

Lastly,  the  consistent  estimates  of  3  and  a, 2  can  be  used  to  predict  the 

missing  y.  for  i  =  s+l,...,T  using  equation  (1.14).   However,  to  correctly  do 

inference  on  the  parameters,  say  a   =  0,  an  additional  measure  must  be  taken 

since  u.  is  heteroscedastic  with  var(u.)  =  an  (1  +  <j>(R.  <5)M.  -  M. )  2  .  A  weighted 
l  l     1       i   i   l 

least  squares  procedure  using  estimates  from  the  probit  estimates  can  be  used 
here.   For  the  data  used  in  this  paper  this  procedure  was  not  as  satisfactory 
as  maximum  likelihood.   The  expense  of  doing  both  a  probit  estimation  and 
then  a  least  squares  regression  was  only  slightly  less  than  the  expense  of 
doing  maximum  likelihood  estimation.  Also,  for  our  sample  of  84  observations 
the  estimates  differed  markedly  from  the  maximum  likelihood  estimates  in  some 
cases,  although  this  criticism  must  be  tempered  with  the  knowledge  that  the 
true  parameters  are  unknown  so  that  no  assurance  exists  that  maximum  likelihood 
is  providing  better  estimates.   However,  by  using  only  the  conditional 
expectation  this  regression  method  seems  likely  to  yield  imprecise  estimates 
unless  the  sample  is  large.   The  problem  arises  because  the  inverse  Mills 
ratio  has  high   correlation  with  X  except  for  extreme  probabilities. 
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For  example,  in  the  first  case  of  the  extent  of  foreign  control  this  procedure 
estimated  p  to  be  only  .03  with  a  standard  error  of  .84,  and  the  estimates  of 
6  are  quite  close  to  the  least  squares  estimates.   Therefore,  its  estimate  of 
the  mean  of  the  missing  observations  is  .5681  only  slightly  below  the  least 
squares  prediction  of  .5724  and  far  above  the  maximum  likelihood  estimate  of 
.2195.   In  the  second  example  of  the  cost  disadvantage  ratio,  its  mean  of 
the  missing  observations  is  1.03  which  is  about  halfway  between  the  least 
squares  estimate  of  .94  and  the  maximum  likelihood  estimate  of  1.10.   Thus, 
this  consistent  but  inefficient  procedure  did  not  perform  well  in  our  example 
of  84  observations. 

This  combined  probit-least  squares  procedure,  however,  may  have 
important  uses  in  large  data  sets  where  its  consistency  property  insures 
accurate  estimates.   Also,  beginning  with  these  consistent  estimates,  one 
step  of  the  maximum  likelihood  algorithm  setting  A  =  1  in  equation  (2.1) 
yields  estimates  with  the  same  asymptotic  distribution  as  the  maximum  likeli- 
hood estimates  as  pointed  out  in  Berndt ,  et  al. ,   [3  ,  p.  659].   However, 
given  the  low  cost  of  computing  the  global  maximum  to  the  likelihood  function, 
this  one  step  efficient  algorithm  was  not  used.   One  last  estimator  should 
be  mentioned.   Amemiya   [  2  ]  ,  has  proposed  a  consistent,  but  inefficient, 
instrumental  variables  estimator  for  the  Tobit  specification  which  may  be 
readily  adapted  to  the  current  problem.   It  requires  considerably  less 
computation  than  the  combined  probit-regression  estimator.   However,  it 
was  not  used  here  since  it  did  not  provide  a  convenient  test  of  the  hypothesis 
of  randomness,  p = 0,  which  we  wanted  to  test  in  the  current  problem. 
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